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Abstract 

We report on scaling and tinning tests of the SIESTA electronic structure code for ab initio nnolecular dynamics simulations 
using density-functional theory. The tests are performed on six large-scale supercomputers belonging to the PRACE Tier-0 
network with four different architectures: Cray XE6, IBM BlueGene/Q, BullX, and IBM IDataPlex. We employ a systematic 
strategy for simultaneously testing weak and strong scaling, and propose a measure which is independent of the range of 
number of cores on which the tests are performed to quantify strong scaling efficiency as a function of simulation size. We 
find an increase in efficiency with simulation size for all machines, with a qualitatively different curve depending on the 
supercomputer topology, and discuss the connection of this functional form with weak scaling behaviour. We also analyze 
the absolute timings obtained in our tests, showing the range of system sizes and cores favourable for different machines. 
Our results can be employed as a guide both for running SIESTA on parallel architectures, and for executing similar scaling 
tests of other electronic structure codes. 



Citation: Corsetti F (2014) Performance Analysis of Electronic Structure Codes on HPC Systems: A Case Study of SIESTA. PLoS ONE 9(4): e95390. doi:10.1371/ 
journal.pone.0095390 

Editor: Danilo Roccatano, Jacobs University Bremen, Germany 

Received February 6, 2014; Accepted IVlarch 26, 2014; Published April 18, 2014 

Copyriglit: © 2014 Fabiano Corsetti. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: This work was supported by CIC nanoGUNE (www.nanogune.eu) and PRACE (www.prace-ri.eu). The funders had no role in study design, data collection 
and analysis, decision to publish, or preparation of the manuscript. 

Competing interests: The author has declared that no competing interests exist. 

* E-mail: f.corsetti@nanogune.eu 



Introduction 

The use of first principles atomistic simulations with density- 
functional theory [1,2] (DFT) has grown from a cottage industry in 
the early 1990s to a routine and integral part of many 
contemporary scientific disciplines, at the meeting point between 
condensed matter physics, physical chemistry, and the new range 
of nanosciences [3,4]. Potential practitioners have a large number 
of ready-made codes to choose from (see, e.g., Refs. [5-16]), which 
distinguish themselves in their licensing models, the range of 
features they offer, the specifics of the technical implementation, 
and, generally, where they lie on the (computational) cost- 
accuracy curve. 

An important consideration for all modern DFT codes is their 
parallel scalability on high-performance computer (HPC) archi- 
tectures, that open up the possibility of simulating very large 
physical systems entirely ab initio. Consequendy, a substantial effort 
has gone into the development and optimization of many of these 
codes for the specific purpose of running on massively parallel 
systems [11,17-28]. Articles describing such developments typi- 
cally illustrate the scaling performance of the code with an 
example of strong scaling [9-27], i.e., the wall time speedup 
obtained for a simulation of fixed size over a range of number of 
cores. Less frequently, weak scaling performance (i.e., an increase 
of the problem size proportionally with the number of cores) is also 
shown [22,25,28]. 

The use of a strong scaling example can be an efiFective way of 
giving a qualitative idea of the parallel efficiency of the code and 
the scale of problems which can realistically be solved with it. 
However, there are a number of issues in using the information as 
it is usually presented for extracting, even approximately, a 



generalized, quantitative measure of performance, such as could 
be used to attempt a comparison between codes. 

Firstly, the range of cores over which this strong scaling is 
investigated is not frxed (as must be the case, since time and 
memory requirements restrict the lower bound, and computation- 
al resources the upper bound). The significance of the demon- 
strated speedup depends crucially on the lower bound of this 
range; furthermore, the dependence is non-trivial. If we assume a 
constant rate of loss of efficiency as the paraUelization is increased, 
a speedup of 3.9 when going from 8 to 32 cores should be better 
than a speedup of 3.8 when going from 2048 to 8192 cores; 
however, this is obviously not the case, as it is clear from 
experience that the actual rate increases significantly with the 
number of cores. Closer comparisons are even harder to judge: is a 
speedup of 3.8 between 512 and 2048 cores better or worse than a 
speedup of 3.7 between 1024 and 4096 cores? There is effectively 
no way to answer this question without making an assumption 
about how to model parallel performance in general. A well- 
known and popular, albeit extremely idealized, way to do is by 
Amdahl's law [29], that describes the overall speedup in terms of 
the paraUelizable fraction of the code P. To the best of our 
knowledge, only one published strong scaling test for a DFT code 
[20] has reported on a fitted value for P. 

Secondly, there is no standard physical system on which to test 
strong scaling. From the point of view of the material itself, this is 
somewhat understandable, as different codes specialize in different 
areas of modelling; a more fundamental problem, however, is that 
strong scaling efficiency changes with system size for a given 
material. Although some studies report system size dependent 
results [9,12], this is generally not the case. How, then, to compare 
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between, e.g., a strong scaling test on a 1532-atom carbon 
nanotube between 2048 and 32768 cores [27], and one on a 1003- 
atom polyalanine peptide between 512 and 65536 cores [23]? 

In this paper, we discuss these issues while reporting on tests of 
the parallel scaling performance of SIESTA [7] , a well-established 
DFT code based on norm-conserving pseudopotentials [30], a 
basis of finite-range numerical atomic orbitals (NAOs), and an 
auxiliary real-space grid for representing the electronic density. 
The tests are performed on six supercomputers (Table 1), currently 
forming the network of Tier-0 systems of the Partnership for 
Advanced Computing in Europe [31] (PEACE). Our aims, 
therefore, are twofold: 
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to give the most up-to-date, comprehensive and reliable results 
of the timing and scaling of SIESTA on modern HPC systems, 
so as to allow users of the code to calculate realistic timing 
estimates over a wide range of number of cores, and therefore 
plan how to make the best use of their computational 
resources; 

to propose a simple framework in which to analyze parallel 
scaling results for all electronic structure codes, arguing in 
particular for the use of Amdahl's law to quantify strong 
scaling performance, and for the importance of investigating 
and reporting this measure as a function of system size. 
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Computational Methods 

Our scaling tests are performed on snapshots of liquid water in 
cubic boxes with periodic boundary conditions. This is the same 
system used previously for parallel benchmarking of the Quickstep 
[9] (CP2K) code; as noted by its authors, liquid water is ideal for 
this purpose, since boxes of any arbitrary number of molecules can 
be created while maintaining the same density and cell shape. 
Furthermore, the lack of crystalline symmetry and the 3D 
periodicity of the material ensure a sufficiently challenging task 
that we expect to give a fair idea of worst-case timings for most 
typical uses of the code, while the presence of a band gap ensures 
that we do not have to worry about convergence issues arising 
during the tests. 

We simultaneously test weak and strong scaling (again similarly 
to Ref [9], by varying both the number of cores, from 32 to 4096 
(Nc = 2",5<n< 12), and the number of water molecules per core, 
from 1 to 32 {Nm/Nc = 2",0<n<5). The resulting suite of tests is 
shown in Fig. 1 . The maximum system size tested is of 4096 water 
molecules (12288 atoms) for all values oi N,„/Nc, except for one 
test of 8192 molecules (24576 atoms) on 8192 cores. We note, 
however, that due to the limited computational time available on 
each machine, not all tests are run on all machines. Weak scaling 
corresponds to moving perpendicular to the N^/Nc axis, while 
strong scaling corresponds to moving diagonally. Instead, system 
size scahng (parallel to the N„,/Nc axis) does not explicitly test 
paraUeUzation, although it is affected by it, as we shall discuss. The 
snapshots for all system sizes are extracted from classical molecular 
dynamics (MD) runs using the TIP4P force field [32] in the 
GROMACS [33] code, equilibrated to 300 K; the cell shape and 
volume are kept fixed at the experimental equilibrium density [34] 
(1.00 g/cm^). 

The tests are performed at the F point only (multiple k points 
being almost embarrassingly parallel), using the semi-local PBE 
[35] functional for exchange and correlation (xc), a 150 Ry cutoff 
energy for the real-space auxiliary grid, and, unless otherwise 
stated, a double-C polarized basis [36] (d^-l-p), corresponding to 
23 NAOs per water molecule; the fraction of occupied eigenstates 
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Figure 1. Three types of scaling that can be investigated by 
systematically varying the number of molecules per core and 
the number of cores. The shaded cells show the suggested set of 
tests to perform on a typical HPC system. 
doi:1 0.1 371 /journal.pone.0095390.g001 

is 4/23 (~ 17%). All system sizes employ 13 self-consistent field 
(SCF) iterations to reach convergence. 

We use the most recent development version of the code 
(siesta-trunk-438), available on the SIESTA website [37]. The 
tests are run with the code's default options for diagonalization, 
employing routines from the ScaLAPACK [38] library: the 
problem is first transformed from generalized to standard form 
by Cholesky factorization with the pdpotrf and pdsygst 
routines, and then the diagonalization itself is performed with 
the pdsyevd divide-and-conquer routine; finally, the back 
transform is performed with the pdtrsm routine. A 2D block- 
cyclic data distribution of the matrices is used, with the matrix 
dimension being an exact multiple of the block size in all cases 
(tests show the ideal block size to be equal to the number of 
orbitals per molecule). 

We choose the standard solver for our tests, as this is currently 
by far the most widely used by the SIESTA community; however, 
we note that several new alternatives are being developed and 
tested: (i) a solver based on the orbital minimization method 
(OMM), which has already been demonstrated to exhibit better 
parallel scaling than explicit diagonalization up to 64 cores [39] 
(available in the development version of the code), (ii) two new 
solvers based on ScaLAPACK, the MRRR algorithm [40] and the 
ELPA library [23] (not yet released), and (iii) a solver based on the 
pole expansion and selected inversion method [41], specifically 
designed for massively parallel architectures (not yet released). 
Finally, the original linear-scaling DFT method implemented in 
SIESTA is also in the process of being redesigned; in its current 
implementation it does not scale well on large clusters. 

The code was compiled on each of the six machines listed in 
Table 1 using the native Fortran compiler and optimized linear 
algebra and communication libraries provided by the .system 
administrators. The Intel compiler and MKL library are used for 
Intel-based machines (IBM iDataPlex and BullX architectures), 
the Cray compiler and ACML library for the AMD-based 
machine (Cray XE6 architecture), and the IBM XL compiler 
and ESSE library for the IBM PowerPC-based machines (IBM 
BlueGene/Q^ architecture). The MPI-2 libraries used are as 
follows: IBM MPI for SuperMUC, Open MPI for MareNostrum, 
BuUX MPI for Curie, and MPICH2 for Hermit, JUQUEEN and 
FERMI. 



Strong scaling 

As previously mentioned, Amdahl's law provides a simple model 
of strong scaling. It states that 
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where Si is the speedup obtained on T^c cores with respect to a 
serial run, t\ and /jv, are the total execution times in serial and on 
Nc cores, respectively, and 5=1— P is the fraction of the code 
that is not paraUelizable (we prefer using 5 instead of the more 
usual P, as the former tends to zero in the limit of ideal scaling). 
Since it is usually not possible in practice to measure t\ for large 
systems, it is useful to define the speedup with respect to a baseline 
number of cores b instead: 
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Using this equation, we can fit our strong scaling data over any 
arbitrary range of number of cores, and obtain a single value 5 
that is in principle independent of this range, and which therefore 
defines the efficiency of the code for any value of Nc as 
1/(1 +5(A^(.— I)). The efficiency is invariantly 100% for a serial 
run, and decreases to zero as A^j.— >oo, since the execution time 
tends to a finite minimum value tiS. 

It is important to note that the conventional interpretation for 5 
and P is necessarily an over-simplification, and should not be 
taken too literally; nevertheless, Amdahl's law qualitatively 
reproduces some universal features of strong scaling, and is 
generally found to provide a good fit to real data. However, such a 
basic one-parameter model can only describe an average scaling 
trend, ignoring any system dependent effects that might favour 
particular values of A^c, e.g., differences in load balancing. Using a 
homogeneous, scalable system such as liquid water and a regular 
grid of tests as shown in Fig. 1 can be effective in minimizing these 
variations, and therefore help to extract clearer general trends. 

Using our timing tests for SIESTA on the six different 
machines, we can analyze the strong scaling of the code for 
system sizes ranging from 64 to 4096 water molecules; however, 
we restrict our fitting of 5 to systems with at least four data points 
(>256 molecules). As a representative example. Fig. 2 shows the 
speedup obtained on SuperMUC (IBM iDataPlex architecture) for 
four different system sizes, together with the curve fitted from Eq. 
2. The resulting 5 values are robust to fitting over different ranges 
(within an order of magnitude), as is the trend of decreasing 5 with 
increasing system size. It is worth noting that this example clearly 
illustrates the difficulty in comparing between scaling tests using 
different ranges of number of cores: despite the steady increase in 
efficiency revelead by the 5 values, the speedups shown in the plots 
appear extremely similar due to the different baselines used. 

The top panel of Fig. 3 summarizes the strong scaling results 
obtained for all six machines: the fitted value of 5 is given as a 
function of system size (i.e., the number of water molecules), for 
systems between 256 and 4096 molecules. Tests for smaller 
systems (32, 64, and 128 molecules) and larger ones (8192 
molecules) are not represented, as there are insufficient data points 
for a reliable fit. 



PLOS ONE I www.plosone.org 



3 



April 2014 I Volume 9 | Issue 4 | e95390 



Performance Analysis of the SIESTA Code on HPC Systems 



256 H2O molecules 
(S= 1.9x10"') 



X! 
0) 
CD 
Q. 

C/3 




512 H2O molecules 



(S = 6.4x10-') 



64 128 192 256 

Num. cores 

1024 H2O molecules 
(S = 2.7x10-') 





128 256 384 512 

Num. cores 

2048 H2O molecules 
(S= 1.3x10') 




256 512 768 1024 

Num. cores 



512 1024 1536 2048 

Num. cores 



Figure 2. Strong scaling on SuperMUC for four different system 
sizes. The full black lines gives the ideal scaling relative to the smallest 
system size. The fit to Amdahl's law is shown by the dashed black line, 
and the corrsponding S value is given above the plot. 
doi:1 0.1 371 /journal.pone.0095390.g002 

For all HPC systems, S is observed to decrease with the size of 
the physical system being simulated. This should not be surprising, 
as it is reasonable to expect efficiency to be related to the number 
of matrix elements/ core (which in turn determines the ratio of 
intracore to intercore operations), and, hence, that the larger the 
system being simulated, the larger the number of cores on which 
the calculation can be performed before the efficiency drops below 
a given threshold. However, the detailed form of this decrease 
depends on many factors related to the nature of the operations 
being performed and the computational architecture, and is 
therefore strongly dependent on the code and the HPC system 
used. 

We can see some interesting distinctions in the S(N,„) curves for 
the six machines. There is a very close agreement between the 
three machines implementing torus topologies (Hermit, JUQU- 
EEN, FERMI), despite Hermit being quite distinct from 
JU QUEEN and FERMI in most other respects, e.g., architecture 
type (Cray XE6 for the former, IBM BlueGene/ Q_ for the latter) 
torus dimension, processor type and speed, number of cores per 
node and amount of memory per core. Instead, the three 
machines implementing fat tree topologies (Curie, SuperMUC, 
MareNostrum), even though they do not exhibit the same level of 
agreement amongst each other, give consistently higher S values 
than the torus machines. 

Furthermore, despite the limited data available, our results 
suggest a qualitatively different form of the decrease of S with A',,, 
for machines with torus and fat tree topologies. The former show 
an approximately linear decrease with slope B on the log-log scale 
(ScciV^*), while the latter exhibit a slowing down of the rate of 
decrease. This is confirmed by fitting the data for each machine to 
a quadratic polynomial on the log-log scale; we find that the 
quadratic coefficient, positive in all cases, is an order of magnitude 
(~3-15 times) smaller for the machines with torus topologies 
compared to those with fat tree topologies. 
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Figure 3. Strong scaling and efficiency. Top panel: S value as a 
function of system size fitted to strong scaling data obtained with 
SIESTA on the six machines; also included are values calculated with 
other DFT codes for a single system size on IBM BlueGene architectures 
(ABINIT: 108 atoms, 11 88 electrons, 3D system, 4 k points [42]; VASP: 87 
atoms, 822 electrons, 2D system, 14 k points [42]; CPMD: 284 atoms, 
1192 electrons, 3D system, k-point sampling unspecified [42]; QE: 1532 
atoms, 5232 electrons, ID system, T point [27,42]; Qbox: 1000 atoms, 
12000 electrons, 3D system, T point [11]). Bottom panel: relationship 
between S and core hour efficiency as a function of the number of 
cores, for four different values of 5* given by the black dashed lines, and 
the fitted values of 5* obtained with SIESTA on four different machines 
for a system of 4096 water molecules; the number of cores at which the 
efficiency is equal to 50% is labelled in each case. 
doi:10.1371/journal.pone.0095390.g003 

Using the S value as a measure of strong scaling, we can 
attempt a quantitative comparison between SIESTA and other 
DFT codes; this is given alongside our results in the top panel of 
Fig. 3. The fits are performed using publicly available scaling test 
data for the codes, published on the website [42] of the FERMI 
IBM BlueGene/ Q_ machine, which we also use for our tests of 
SIESTA; the same data for the Quantum ESPRESSO (QE) code 
has also been published in an article describing development work 
on the code [27]. The only exception is the Qbox code, for which 
we used previously pubhshed tests [1 1] performed on an IBM 
BlueGene/L machine. Where possible, we select tests performed 
F-only. AH codes considered employ a plane-wave basis, in 
contrast to SIESTA's much smaller NAO basis. 

It is important to stress that this comparison serves mainly to 
highlight the inadequacy of the available data; indeed, the change 
in S over more than three orders of magnitude for SIESTA at 
diflFerent system sizes is similar to the range spanned by the results 
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obtained for the other codes, each available at a single system size. 
Both the system size and type vary greatly between codes, from an 
87-atom 2D system for VASP to a 1532-atom ID system for QE. 
Other important factors (k-point sampling, xc functional, basis 
accuracy, code optimization) are also not controlled for. 

Nevertheless, Qbox stands out from all other codes for the 
impressive strong scaling performance demonstrated, with an S 
value more than an order of magnitude lower than that obtained 
by its closest competitor, QE, despite using a smaller system size 
(1000 atoms). Indeed, Qbox has been developed not only for 
massively parallel calculations in general, but specifically for 
running on IBM BlueGene architectures [11,43]; based on these 
results, it is the only DFT code to have demonstrated the potential 
to make efficient use (>50%) of the entirety of a large BlueGene 
machine such as JUQUEEN or FERMI for a single F-point 
calculation. 

System size scaling and absolute timings 

Strong scaling is purely a test of parallel scalability, for which 
the code is, by definition, taken to be 100% efficient when run in 
serial. The results presented so far, therefore, contain no 
information about absolute timings. Although it is convenient to 
separate these two aspects of the code's performance, we should 
remember that the execution time is the only factor of importance 
to the end user. Therefore, strong scaling data on its own can 
sometimes be misleading, as a code that is very fast in serial but 
which exhibits poor strong scaling might nevertheless achieve a 
lower execution time on a medium-sized cluster than one that is 
very slow in serial but with exceptional scalabihty. 

In order to extract a measure of absolute timing from our tests, 
we need to be able to effectively model system size scaling. For a 
conventional DFT code that calculates the eigenvalues and 
eigenvectors of the Kohn-Sham equation [2], either by explicit 
diagonalization (as we do here) or by an iterative minimization 
algorithm, it is well known that the calculation time scales 
cubically with system size (i.e., the number of atoms/molecules/ 
basis orbitals). Linear-scaling methods [44], which make use of 
approximate spatial truncations based on the principle of 
electronic nearsightedness [45], are also now well established 
and have been implemented in a number of popular codes. 

The bottom panel of Fig. 4 shows the results of all timing tests 
performed on two machines, JUQUEEN and SuperMUC; we plot 
the execution time for all number of cores, extrapolated to that of 
a single core as fiv^Si (from Eq. 1), against the system size (the 
number of water molecules Nm)- The estimated speedup Si is 
obtained for any value of N„, by using the fits of 5'(A';„) to the data 
in the top panel of Fig. 3, as described in the previous section. The 
resulting plot very clearly shows an almost pure cubic scaling with 
system size for both machines (the linear fits on the log-log scale 
have a frxed slope of 3). There is an excellent agreement in the 
extrapolated timings for each system size independently of the 
number of cores, and, even more encouragingly, our estimate of 
Si appears to be robust even when extrapolating beyond the 
range of N„, used in the fitting of S. 

From these results, we can justify the use of a basic single- 
parameter model for system size scaling, of the form t\ =aN^j; 
lower-order terms are negligible even for the smallest system sizes 
considered here; this is because all the default routines in SIESTA 
other than the diagonalization procedure itself are linear-scaling 
by design. Within an SCF iteration, the contribution from building 
the sparse Hamiltonian matrix only become comparable to 
diagonalization for very high values of the cutoff energy defining 
the real-space grid, or non-local xc functionals such as those 
including dispersion interactions. We note here that we have also 
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prefactor a for the cubic scaling with system size of the execution time 
in serial for the self-consistent calculation of the liquid water system (1 3 
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absolute timing data, extrapolated for all number of cores to serial 
timings using Amdahl's law and a fitted analytical expression of the 
strong scaling performance as a function of system size. 
doi:10.1371/journal.pone.0095390.g004 

analyzed the strong scaling of individual SIESTA modules, finding 
diagonalization to be the bottleneck within an SCF iteration, while 
Hamiltonian construction is very efficient when using the 
parallelization strategy for the grid operations of Sanz-Navarro et 
al. [21] (accessible via the flag -DBSC_CELLXC at compilation). 

The parameter a, obtained by the fits shown in the bottom 
panel of Fig. 4, can therefore be used to compare the speed of the 
various machines, independently of differences in scaling perfor- 
mance. The values of a obtained for all six machines are shown in 
the top panel of Fig. 4. The large variation in a over almost two 
orders of magnitude is a reflection not simply of the machines' 
processor speeds (listed in Table 1), but also of numerous other 
interacting factors, such as the efficiency of the different compilers 
and libraries. In general, torus machines, which exhibit the best 
scaling, are predicted to be the slowest in serial, while fat tree 
machines, which do not scale as well in parallel, are predicted to 
be the fastest. 

We can now calculate a rough estimate of the execution time on 
each machine for any number of water molecules on any number 
of cores, by using our fits of the function S{Nm) and the parameter 
a, and, hence, buUd up a N„,-Nc 'hase diagram' of the machine 
with the lowest execution time. This is shown in Fig. 5: the main 
panel compares real timings, while the inset uses the estimates 
based on our fits. The agreement is best for large system sizes and 
number of cores, with some discrepancies appearing for N„, < 256; 
this is not surprising, due both to the extrapolation of S{N„) to 
low values, and the fact that the timings are very close for more 
than one machine. 

The machines which gives the lowest absolute timings over the 
entire tested range of Nc are overwhelmingly those with fat tree 
topologies, despite their inferior strong scaling performance with 
respect to torus machines. Two large regions can be clearly 
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dol:1 0.1 371 /journal.pone.0095390.g005 

identified: Curie (BuUX architecture) is the fastest machine for 
simulations with Nc<\2S, while SuperMUC (IBM iDataPlex 
architecture) is the fastest above this value. There is some 
indication, confirmed by the model, that for large system sizes 
{N„, :i 4096) MareNostrum (IBM iDataPlex architecture) becomes 
faster than both of these machines (this might seem surprising, 
since it has the lowest value of a, and, hence, should be the fastest 
in serial at all system sizes; however, it also exhibits the worst 
parallel scaling, making it less efficient than other machines for 
parallel calculations on even very few cores at modest system sizes). 
It is only for extremely large A^,. that the qualitatively different 
decrease in S{Nn,) of the torus machines is predicted to lead to the 
lowest absolute timings, in particular for Hermit (Cray XE6 
architecture), as it has a significantiy lower a value than the IBM 
BlueGene/ machines. 

Our fitted models for the six supercomputers can also be used in 
a broader context, to estimate the execution time for any typical 
SIESTA calculation on HPC systems similar to the ones tested 
here. In fact, since the timing is dominated by the diagonalization 
procedure, especially for large system sizes, we can base our 
estimation on only two parameters, the total number of basis 
orbitals and the number of SCF iterations; we can safely neglect, to 
a first approximation, other parameters such as the number of 
electrons and the number and type of ions. This is illustrated in 
Fig. 6, in which we compare calculations using the standard dC + p 
basis to ones using a larger q( + dp basis [46] (twice the number of 
NAOs per water molecule) over a wide range of number of cores; 
as can be seen, timings on a given number of cores depend only on 
the total number of basis orbitals, and so a calculation using the 
larger basis takes the same time as one using the smaller basis with 
twice the system size. We note that this simple behaviour is due to 
the use of a solver which computes all eigenvalues by explicit 
diagonalization. Instead, solvers based on iterative minimization 
techniques (typically employed by plane-wave codes) scale only 
quadraticaUy with the number of basis functions [39]; for such 
codes, we would expect the dependence oi S{N„,) on basis size to 
be non-trivial. Unfortunately, we are not aware of published data 
for any other DFT code that could help in investigating this issue. 
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Figure 6. Timing comparison on Curie for two different SIESTA 
basis sets. Each data point plots the execution time of a particular 
system size simulated with the dC + p basis (23 NAOS/H2O molecule) 
against that of a different system size simulated with the q4' + dp basis 
(46 NAOS/H2O molecule), chosen so that the two systems have the 
same total number of basis orbitals. The two system sizes are shown In 
brackets (dC + p/qC + dp); In each case, both simulations are performed 
on the same number of cores, equal to the number of molecules In the 
q4' + dp system. 

dol:10.1371/journal.pone.0095390.g006 

In order to allow SIESTA users to obtain absolute timing 
estimates for their parallel calculations, we have released a web 
applet [47] based on the model we have described and the 
quantitative data obtained from our scaling tests. We also include 
a version of the applet for offline use in the Supporting 
Information (Code SI); details of the fits and the final set of 
parameters for the six machines can be easily found within the 
code. 

Weak scaling 

Finally, we briefly discuss the weak scaling behaviour demon- 
strated by the code. Weak scaling is of most interest to linear- 
scaling DFT codes [22,25], for which the objective is to obtain a 
constant time-to-solution as the problem size is increased together 
with the number of cores (this is also known as Gustafson's law 
[48]). Cubic-scaling codes, instead, can achieve at best a quadratic 
weak scaling behaviour, which is rarely investigated [28,39]; 
nevertheless, it can provide useful information on the limits of 
efficiency of the code. 

We find it convenient to plot the execution time divided by the 
square of the number of cores, so that ideal weak scaling behaviour 
will appear flat, analogously to the case of a linear-scaling code. A 
representative example for one machine, JU QUEEN, is shown in 
Fig. 7. Surprisingly, we observe better than ideal weak scaling, 
tending towards ideal as the number of cores is increased. The 
effect becomes more pronounced as the number of water 
molecules per core is decreased. These trends are almost perfecdy 
reproduced by the timing estimates provided by our combined 
modelling of strong scaling and system size scaling. 

We can vmderstand this behaviour as a change in efficiency (as 
defined in the bottom panel of Fig. 3) due to the interplay between 
the decrease of 5 with N„, and the increase of A^c- Similarly, it is 
interesting to note that system size scaling at a fixed number of 
cores > 1 deviates from its ideal cubic behaviour in serial. 

If we assume S{Nm) to be of the form AN,^^, it is easily 
verified from the model that the weak scaling behaviour will tend 
towards ideal for B>\; in the case of JUQUEEN, the fit gives a 
value of 1.8. This result applies equally to linear- and cubic-scaling 
codes, when using the appropriate definition of ideal weak scaling 
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Figure 7. Weak scaling on JUQUEEN for different numbers of 
water molecules per core. The execution time Is divided by the 
square of the number of cores. The dashed lines show the estimates 
given by the fits of S{N„,) and a. 
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for each; indeed, a strikingly similar behaviour is reported for the 
linear-scaling Conquest code [22]. As noted previously, the three 
machines with fat tree topologies appear to exhibit a slowing down 
in the decrease of S with system size; although this should 
eventually make the weak scaling less than ideal, in practice it is 
not noticeable within the range of cores considered. 

Conclusions 

In this paper, we have investigated the performance of the 
SIESTA code on the six supercomputers of the PRACE Tier-0 
network, currentiy amongst the largest in Europe. We propose a 
systematic investigation of parallel scaling using self-consistent 
calculations of snapshots of liquid water, varying both the number 
of cores on which the simulation is run and the number of water 
molecules per core; the largest simulation performed in our tests is 
of 8192 molecules on 8192 cores. 

The results are analyzed using Amdahl's law to fit the data for 
each system size, providing a quantitative estimate of the code's 
efficiency over all number of cores based on a single parameter S; 
the scaling performance of the code, therefore, is completely 
described by the curve of S as a function of system size. We fmd a 
qualitative difference in this curve depending on the topology of 
the connections between nodes in the supercomputer, with 
machines implementing torus topologies demonstrating a better 
scalability to large system sizes than those implementing fat tree 
topologies. Despite this, however, the latter group is shown to give 
lower absolute timings for almost all simulations within the tested 
range, as the performance on individual cores is significantly 
faster; furthermore, such architectures tend to offer a larger 
amount of memory per core, which can become an important 
issue either when running on few cores, or as the size of the 
simulation is increased (the memory requirements scale approx- 
imately quadratically with system size). 



Combining Amdahl's law for strong scaling with a basic one- 
parameter model for system size scaling, both of which are fitted to 
the data provided by our tests, we can calculate a simple estimate 
of the execution time on a given number of cores for a generic 
total energy calculation with SIESTA; a new web applet [47] 
developed in conjunction with the paper allows users of the code to 
employ this model for planning their projects on parallel 
architectures. An estimate of the memory requirement per core 
is also included. 

Throughout the paper we have emphasized potential points of 
comparison with other DFT and electronic structure codes. 
Investigating and reporting S{Nm) curves for different HPC 
systems could provide valuable information to practitioners in the 
field, as well as for the ongoing development of the codes 
themselves. Care must be taken, however, when interpreting the 
results of comparisons based on strong scaling data, due to the 
fundamental differences between codes. Basis sets offer perhaps 
the most important example: is it meaningful to compare the 
strong scaling performance of a localized-orbital code and a plane- 
wave code for the same physical system? It is clear that S varies 
with basis size, and so is crucially dependent in both cases on the 
precision level of the calculation; even disregarding the technical 
challenges involved [46], attempting to equate the two bases is not 
necessarily appropriate, as the codes are designed from the outset 
to be used with difierent aims. For this reason, we suggest that the 
best approach should not be overly competitive; rather, the 
objective should be to report on calculations using the typical setup 
appropriate for each code (e.g., the default dC + p basis for 
SIESTA), or possibly a range of different setups, as this wiU 
provide the most useful information for its users. 

Supporting information 

Code SI Bash script for calculating SIESTA timing 
estimates. These are based on the fits to the data presented in 
the paper. Instructions for using the script are included as a 
comment at the start of the code. The script is also available as a 
web applet [47]. 
(TXT) 
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